Defining and IVIining Functional Dependencies in 
Probabilistic Databases 



Sushovan De 

Deptt. of Computer Science and Engineering 
Arizona State University 

sushovan@asu.edu 



Subbarao Kambhampati 

Deptt. of Computer Science and Engineering 
Arizona State University 

rao@asu.edu 



O 
(N 

o 

Q 



PQ 

q 

> 

l> 

^. 

in 

o 
o 

> 

X 

J3 



ABSTRACT 

Functional dependencies - traditional, approximate and con- 
ditional are of critical importance in relational databases, as 
they inform us about the relationships between attributes. 
They are useful in schema normalization, data rectification 
and source selection. Most of these were however devel- 
oped in the context of deterministic data. Although uncer- 
tain databases have started receiving attention, these de- 
pendencies have not been defined for them, nor are fast al- 
gorithms available to evaluate their confidences. This paper 
defines the logical extensions of various forms of functional 
dependencies for probabilistic databases and explores the 
connections between them. We propose a pruning-based 
exact algorithm to evaluate the confidence of functional de- 
pendencies, a Monte-Carlo based algorithm to evaluate the 
confidence of approximate functional dependencies and al- 
gorithms for their conditional counterparts in probabilistic 
databases. Experiments are performed on both synthetic 
and real data evaluating the performance of these algorithms 
in assessing the confidence of dependencies and mining them 
from data. We believe that having these dependencies and 
algorithms available for probabilistic databases will drive 
adoption of probabilistic data storage in the industry. 

1. INTRODUCTION 

A lot of data generated today, especially that obtained 
from the web, is dirty, untrustworthy or uncertain. Yet 
we continue to store them in database engines that are ill- 
equipped to handle uncertainty. Handling uncertainty isn't 
a simple case of adding a 'probability' attribute - uncertain 
data has correlations, causations and query processing on 
such data is a probabilistic inference problem. It should be 
a top-priority for the database community to remove barri- 
ers that prevent data engineers from adopting probabilistic 
databases. Data obtained from the web becomes uncertain 
for a variety of reasons including the hardness of schema 
mapping, record linkage and presence of untrustworthy or 
dirty data. Traditionally, these problems have been tackled 
by taking the data one tuple at a time, and choosing the best 
possible alternative for it. Once the alternative is decided, 
the data is considered to be completely correct for the pur- 
pose for further analysis. However, probabilistic databases 
allow the data to fully reflect the different alternatives that 
were available. Further processing can then take into ac- 
count all the alternatives, not just the most likely one. 

One of the barriers to using probabilistic databases is 
the lack of well-defined dependencies between attributes. 
In the case of traditional databases, such dependencies, in 



the form of exact and approximate functional dependencies, 
are used for both fast query processing and rectification of 
data. There are algorithms that will evaluate functional de- 
pendencies between attributes to aid in schema normaliza- 
tion '3' , that will evaluate approximate functional dependen- 
cies (AFD), which helps filling out missing data in incom- 
plete databases 16 . There are also conditional functional 



dependencies (CFDs), which help in cleaning and correcting 
data pi. However, such dependencies and algorithms are 
missing for probabilistic data. 

In this paper we extend these very useful dependencies 
so that they work with probabilistic databases in general. 
We generalize FDs to probabilistic functional dependencies 
(pFD), AFDs to probabilistic approximate functional de- 
pendencies (pAFD), and their conditional counterparts re- 
spectively to CpFD and CpAFD. We also investigate the 
relationship between these dependencies. In particular, we 
point out which of these dependencies are generalizations of, 
and hence subsume, others. We also provide fast algorithms 
for evaluating the confidence of these dependencies on prob- 
abilistic database, with a special focus on tuple-independent 
and tuple-disjoint independent databases (71. With the help 
of these algorithms, we describe how we can mine these de- 
pendencies from data, by using efficient methods to prune 
the search space of dependencies. 

Motivating example: Assume that two or more as- 
tronomers are observing and recording various objects in 
the sky, like in the Sloan Digital Sky Survey [l]. They note 
various attributes of the objects, including the color, type, 
speed, frequency of oscillation, and position. Each observer 
may have his/her doubts about the data, so they may choose 
to enter alternatives as options in a probabilistic database. 
Such data is most naturally represented as a probabilistic 
database, where each tuple represents a different object in 
the sky and reflects the curator's confidence in the observer 
as well as the observer's confidence in the data. Having rep- 
resented this data, the curator can run a pAFD finding al- 
gorithm to discover dependencies that were as yet unknown, 
of the form {color, speed ~r type). If the curator is aware of 
some dependencies that are expected to hold, he can verify 
their validity by running the appropriate dependency check- 
ing algorithm on this data. 

One pertinent question is whether these extensions are im- 
portant or interesting enough to consider, or whether AFDs 
can be used directly. Pending empirical study, (see Section 
|5.2[ ), we demonstrate their importance with an example. Let 
us say that we have a probabilistic database as shown in Fig- 
ure IT] We are interested in finding out whether or not the 
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Figure 1: Why pAFDs differ from a naive inter- 
pretation of AFD. (top) A tuple-disjoint indepen- 
dent database, (right) An attempt to find the AFD 
naively results in confidence of 0.5. (left) The se- 
mantically correct value of 0.75. 

dependency Color ~f Type has a high confidence. Strictly 
speaking, we cannot evaluate its AFD, because the concept 
of AFD does not apply to probabilistic data. However, we 
can apply what might be called an 'intuitive' extension of 
an AFD to probabilistic data by considering each option 
as an independent tuple and weighing them according to 
their probabilities. If we do that. Figure [l] shows that this 
naive method calculates a probability that is quite low, and 
different from the correct value dictated by probabilistic se- 
mantics. 

The rest of the paper is organized as follows. We start 
by defining the various dependencies for probabilistic data- 
bases, and in the following section we explore the theoretical 
connections between them. In Section HI we propose some 
algorithms to evaluate the confidence of these quantities and 
show how to mine them, and in the section following that, 
show experiments that evaluate the effectiveness of these al- 
gorithms. We end with a discussion of related work and 
present our conclusions in Section [7] 

2. DEFINITIONS 

2.1 Probabilistic Database 

In this paper we follow the possible worlds model for a 
probabilistic database: a probabilistic database is a collec- 
tion of possible worlds, with each possible world being a 
deterministic database and an associated probability. Fig- 
ure [l] shows a probabilistic database. The possible worlds 
representation is the set of relations on the bottom left. 

We denote a deterministic relation by symbol i?, which 
has attributes (Ai, A2, ..., j4„). Sets of attributes are de- 
noted by X or Y . Uncertain relations are denoted by D, and 
each uncertain relation comprises possible worlds {Pi,P2, ...Prr 
with and attributes (Ai, A2, ..., A„). 
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2.2 Probabilistic Functional Dependencies (pFD) 

Given a deterministic relation R a functional dependency 
(FD) is defined as {X -^ Y), where X and Y are sets of 
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Figure 2: The relationship between various depen- 
dencies. FDs are the least general, when extended 
by adding degree of truth, w^e get AFD; when ex- 
tended to probabilistic databases, w^e get pFD, when 
extended to include conditional dependencies, we 
get CFDs. Combinations of these properties give 
rise to the other dependencies. 

attributes. The FD is said to hold if whenever two tuples 
share the same values of X, they have the same values of Y . 

We can generalize this idea to probabilistic databases, as 
shown in Figure[2]by the 'uncertainty in data' axis. Given an 
uncertain relation D, a probabilistic functional dependency, 
pFD, is defined as (X ^ Y). The pFD is associated with a 
quantity called its confidence which is the fraction of possible 
worlds in which the corresponding FD holds. 

Consider the possible world representation in Figure [l] 
In the first possible world, the data in tuple 1, (Red, Star) 
confiicts with tuple 3, (Red, Nebula). As a result the contri- 
bution of that possible world towards the confidence of the 
pFD Color -^ Type is zero. The last possible world in the 
figure shows a non-zero contribution. The FD holds within 
the world, hence the entire probability of the world is added 
to the pFD confidence score. 

It should be noted that pFDs suffer from the same kind 
of fiaws that traditional FDs do. If the data is dirty, just 
a few tuples that do not conform to the pFD might cause 
an entire possible world to be not counted. We can address 
these concerns with pAFDs. 

2.3 Probabilistic Approximate Functional 
Dependencies (pAFD) 

AFDs generalize FDs by adding the concept of a 'degree 
of truth' to an FD, which is also illustrated in Figure [2] 
Given a deterministic relation R, an approximate functional 
dependency AFD is defined as {X ~f Y), where X and Y 
are sets of attributes, with X approximately determining Y . 
The confidence of an AFD may be defined in various ways. 
Following [9^ , we define the AFD confidence as as one minus 
the minimum fraction of tuples that need to be removed 
from the relation for the FD to hold. 

In an uncertain relation D, a probabilistic approximate 
functional dependency, pAFD, is defined as (X -^ Y). The 
confidence of the pAFD is the expected confidence of the 
AFD over the possible worlds, (i.e. average of the confidence 
of the corresponding AFD in each possible world weighted 
by the probability of that world). 

For example, in Figure IT] in the first possible world, the 
data in tuple 1, (Red, Star) confiicts with tuple 3, (Red, 
Nebula). As a result only one of them can be considered as 



contributing towards the AFD within that possible world. A 
similar argument holds for tuples 2 and 4. We can therefore 
say that among the four tuples in the possible world, two 
support the AFD, and two don't; hence the confidence of the 
AFD is 0.5. In situations where data is both uncertain and 
noisy, pAFDs are useful for judging the relationship between 
attributes. 

2.4 Conditional Probabilistic Functional 
Dependencies (CpFD) 

We can extend functional dependencies in yet another 
way - by making them conditional on specific values of the 
data. Given a deterministic relation R, a CFD is a pair 
(X -* Y,Tp). X ^ Y is a. standard FD. Tp is the pattern 
tableau, which is a relation with the attributes (XuY). The 
semantics of a CFD [4] are as follows: A CFD holds on R 
if the corresponding FD holds on the subset of tuples that 
matches the pattern tableau. Every tuple in Tp is either a 
constant or the wildcard character (_). A constant a in Tp 
matches only the constant a in R. The wildcard '_' in Tp 
matches any value in the real tuple. For a pair of tuples to 
violate the CFD, they must agree on every attribute in X 
but different values for some attributes in Y , and the set of 
attributes XuY must match the pattern tableau. 

We can extend this concept to probabilistic databases. 
Given an uncertain relation D, a conditional probabilis- 
tic functional dependency, CpFD, is the pair {X -^ Y,Tp), 
where X and Y are sets of attributes and Tp is the pattern 
tableau. The confidence of a CpFD is the fraction of possible 
worlds where the corresponding CFD holds. 

A CpFD is an extension of the concept of CFD to uncer- 
tain databases, but it does not tolerate dirty data. If even 
one tuple in a possible world violates the CpFD, the entire 
possible world probability is not counted. 

2.5 Conditional Probabilistic Approximate 
Functional Dependencies (CpAFD) 

Given a deterministic relation R, a conditional approxi- 
mate functional dependency, CAFD, is defined as the pair 
{X ~r Y,Tp). The confidence of the CAFD is one minus the 
fraction of tuples that need to be removed from the subset 
of tuples that match the pattern tableau Tp such that the 
CFD (X -* Y) holds. 

CAFDs do support a fractional confidence value, thus they 
can be used where the data is expected to be noisy and 
when the dependency holds only conditionally. Therefore 
it is useful to extend for probabilistic data. Given an un- 
certain relation D, a conditional probabilistic approximate 
functional dependency (CpAFD) is the pair {X ~f Y,Tp). 
Its confidence is the weighted average of the confidence of 
the corresponding CAFD in each possible world, weighted 
by the probability of that possible world. 

A CpAFD extends the notion of a functional dependency 
in the most general way among all of the dependencies dis- 
cussed in this paper. It supports fractional confidence val- 
ues, possible world semantics, as well as operating on a select 
part of the database. 

3. RELATIONSHIPS AMONG DEPENDEN- 
CIES 

The dependencies defined in Section p] are all generaliza- 
tions of FDs, as shown in Figure [2] It is natural therefore. 



that we can express the less general dependencies in terms 
of the more general ones, and that we can induce relation- 
ships among them. That is what we will attempt to do in 
this section. 

An FD is an AFD of confidence 1. AFDs allow dependen- 
cies that hold approximately, thus they extend FDs along 
the 'degree of truth' axis as shown in Figure [2] It should 
be noted that the confidence of an AFD can never be zero. 
This is because, in a relation R with at least one tuple, even 
if all different values of Y occur with the same values of X, 
we can remove all tuples except for one for each different set 
of values of X and have a non-zero set of tuples left in our 
database. Thus, the confidence of the AFD, which is is one 
minus the fraction of tuples removed, is non-zero. We can 
make a similar comparison between pFDs and pAFDs. 

Theorem 1. The confidence of a pAFD is always larger 
than the confidence of the corresponding pFD. 

Proof. In every possible world that the pFD holds, the 
confidence of the pAFD is 1. In every possible world that the 
pFD does not hold, the confidence of the pAFD is non-zero. 
Thus the confidence of the pAFD, which is the weighted sum 
of the confidences in each possible worlds, would be greater 
than that of the pFD. D 

However, the reverse is not true. The information that a 
pAFD holds with a very high confidence does not imply that 
the pFD will have a high confidence, in fact, the confidence 
of the pFD may even be zero, as there might be a few tuples 
in every possible world that do not conform to the pFD. 

Conditional dependencies extend each of the previous de- 
pendencies so that they can be specified over only a part of 
the data. This is illustrated in Figure [2] by the 'conditional' 
axis. However, the generalization to conditional dependen- 
cies can introduce inconsistencies. For example, if we know 
that an FD holds on a relation, the corresponding CFD is 
not guaranteed to hold, since the tableau could introduce 
impossible cases pi. For example, in a CFD (A -> B), if the 
pattern tableau has two tuples, one requiring the value of 
B to be &, the other requiring the value to be c, the CFD is 
clearly inconsistent, and no relation can satisfy it. However, 
in normal use cases, where the tableau of the CFD has been 
induced from the data, or else calculated in a non-malicious 
manner, intuitively we can state that if the FD holds, the 
CFD will also hold. The converse, however, is not true. If 
a CFD with a non-trivial tableau holds over a relation R, 
then we cannot guarantee that the corresponding FD holds. 

A pattern tableau may select a non-zero number of tuples 
from some possible worlds but no tuples from others. In the 
special case where the tableau selects some tuples from each 
possible world, we can state that the confidence of the CpFD 
and the confidence of a CFD are equal. More generally, we 
can state the following theorem: 

Theorem 2. // a pFD {X -* Y) holds over an probabilis- 
tic relation D wtth confidence p, then the CpFD {X -* Y,Tp) 
also holds over R with a confidence not less than p, provided 
the CpFD is not inconsistent. 

Proof. Suppose the pFD holds on a certain possible 
world of D. The pattern tableau would then cause certain 
tuples to be eliminated from consideration. Even after elim- 
ination of a few tuples, the pFD will continue to hold, by 
definition. Thus that possible world will contribute towards 



the confidence of the CpFD. On the other hand, if there is 
a possible world in which the pFD does not hold, it is pos- 
sible that those tuples that cause the pFD not to hold will 
be eliminated by the pattern tableau, causing the possible 
world to contribute to the confidence of the CpFD. So the 
fraction of possible worlds contributing to the CpFD is not 
smaller than the fraction contributing to confidence of the 
pFD. n 

The same argument does not hold for pAFD and CpAFDs. 
In the case of CpAFDs, elimination of certain tuples may 
reduce its confidence. For example, consider the CpAFD 
(A ^ {B, C},Ti) where Ti consists of the tuple (_, 62, -)■ If 
one of the possible worlds, P, of relation D contains a 50 
tuples of (ai,&i,ci) and 50 tuples of the form (ai, 62,02), 
(ai, 62,23) . . . (ai,fe2,C5i) the pAFD would have confidence 
0.50, but the CpAFD would have confidence 0.02 (after elim- 
inating the 50 tuples not matching Ti). 

4. ASSESSING AND MINING PROBABILIS- 
TIC DEPENDENCIES 

We consider two problems in the presented framework: 

Evaluating confidence: Given a relation R, and a spec- 
ified dependency find the confidence of the dependency. 

Mining dependencies: Given a relation R, find a min- 
imal set of dependencies that is equivalent to or more gen- 
eralHhan any set of dependencies that holds over R with a 
confidence higher than a given threshold. 

In the following sections we focus on evaluating the con- 
fidence of the dependencies, since it is the first step towards 
mining them. Once we have fast algorithms for evaluating 
the confidence, we then use methods in 16 to prune the 



space of dependencies we have to search through in order to 
mine them. 

While we're interested in computing the confidence for any 
probabilistic database, we shall see that in the very general 
case, evaluation is exponential. So we also consider special 
cases, mainly focusing on tuple-disjoint independent (TDI) 
databases (^ , which are a popular special case of a probab- 
ilistic database, in which every tuple with distinct keys is 
independent. We can think of this as a set of uncertain rela- 
tions, where each tuple has a set of "options" with each op- 
tion having a probability. The decision about which option 
to pick for each tuple is taken independently. This signifi- 
cantly reduces the types of uncertainty and correlations that 
can occur among the tuples, but also makes many operations 
on the database tractable. These algorithms also work for 
Tuple-independent databases (TI), where each tuple has an 
existential probability, but does not have any options. The 
straightforward adaptation of these algorithms to TI data- 
bases is explained in the Appendix|9.1[ 



4.1 Assessing the confidence of a pFD 

In a generalized probabilistic database, represented by its 
possible worlds, finding the pFD would be polynomial in 
the combined size of the possible worlds, which is typically 
exponential in the number of entities it represents. For a 
more compact representation of a probabilistic database like 




Figure 3: The algorithm for computing the confi- 
dence of pFD. The dotted circles represent adding 
the probabilities from the child branches, the solid 
circles represent multiplication of probabilities of 
the child and the parent. 

TDI or TI, a naive evaluation of the confidence of a pFD in a 
probabilistic database would likely take an exponential time 
in the number of tuples, since we would have to effectively 
generate the possible worlds and add up the probability of 
those in which the corresponding FD holds. Ordinarily, we 
would use Monte Carlo to sample the possible worlds (such 
as we will employ later to find the confidence of pAFDs), 
however, that approach does not work very well for pFD, 
since a single option that violates the dependency can bring 
the contribution of the entire sample to zero. 

We now present an efficient algorithm that finds the con- 
fidence of a pFD in a tuple-disjoint independent database. 
This algorithm is exact and has the property of being ex- 
ponential only in the cardinality of the domatn of the at- 
tributes, rather than the number of tuples. In practice our 
algorithm finds useful probabilistic functional dependencies 
very efficiently, as most desirable pFDs have low specificity. 

The complexity of the algorithm can be analyzed in terms 
of specificity is defined as the support of the association 
rules that make up the dependency 
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^A set of dependencies Ei is said to be more general than 
another set E2 if every dependency in E2 can be inferred 
from El using Armstrong's axioms and the inference rules 
for conditional dependencies in ,4.. 



Specificity is high 
when the association rules have a very low support, and 
the rule becomes less valuable. For example, a rule with 
high specificity such as (Social Security Number ->■ Color of 
hair) will definitely hold, since SSN is a key, but is not very 
useful. On the other hand, an FD that states (Zip Code 
-^ Street Name) for addresses in England is a useful one. 
Each zip code appears multiple times in the database and 
the dependency gives us useful semantic information even 
though zip code is not a key. For a low specificity pFD, the 
number of values a particular attribute can take is much less 
than the number of tuples, which makes our algorithm run 
more efficiently. A formal definition of specificity and its 
adaptation to TDI databases is presented in the Appendix 
9:21 

The algorithm exploits the fact that we are using a tuple- 
disjoint independent database. It keeps track of a set of 
association rules that comprise the pFD at any point in 
the algorithm. We pick the tuples one by one, and treat 
them independently. We next optimize the calculation of 
the pFD using two pruning criteria. First, if a particular 
option does not comply with the current set of rules, then 
the entire set of possible worlds that include that option 
will contribute zero confidence for the pFD. So we can ter- 



Algorithm 1: FindPfdRecursive 



input : A TDI database D, a pFD P and the current 

set of rules R 
output: The confidence of P in D 
begin 

T = C'hooseBestRemainingTuple{D , R) 
if EntireTupleisC ompatible{R) then 
I return FindP fdRecursive{D \ T, R) 

for Options O eT do 

if IsCompatihle(0, R) then 
AddRule{0,R) 

P + Prob.(O) X FindP fdRecursive{D \ T, R) 
RemoveFromRule{0 , R) 



return P 



minate that branch right away. Second, all the options in 
the tuple comply with the ruleset, then the confidence of the 
pFD does not change whether or not we pick that tuple (it's 
contribution is 1). We can therefore ignore the tuple. 

We can express the evaluation of this algorithm as a tree, 
see Figure [3] The tree branches whenever more than one 
option in a tuple is consistent with the current set of rules. 
Using the two criteria in the previous paragraph, we can 
choose an optimal order in which to pick the tuples so that 
the expression tree has the minimum width. We can then 
prove that the algorithm is exponential only in the cardinal- 
ity of the domain of the attributes. 

Once the execution reaches the leaf node, we track back. 
At each stage, we sum up the probability of all the branches 
that originated at that point. Then we multiply the result 
with the probability of the parent to compute the contribu- 
tion of this branch to the confidence of the pFD. 

The algorithm is formally presented as Algorithm [l] The 
running time of the algorithms is improved by the introduc- 
tion of the function ChooseBestRemainingTuple, which 
picks from the remaining tuples those that do not cause 
branching in the evaluation tree. There are three kinds of 
such tuples - those that completely comply with the cur- 
rent set of rules (their contribution is 1), or those that have 
no options that comply with the current set of rules (the 
branch is immediately terminated), or those that have only 
one option that complies with the current set of rules (there 
is no branching). 

4.2 Assessing the confidence of a pAFD 

It was possible for us to considerably speed up the calcu- 
lation of the confidence of a pFD because as soon as an exe- 
cution branch violated the association rules for that branch, 
we were able to terminate it. However, this technique does 
not work for calculating the confidence of a pAFD. During 
the execution of a similar algorithm for pAFD, if a tuple 
violates the majority association rule, it does not terminate 
the branch - it merely reduces the confidence within that 
possible world. In addition to this, the association rules 
themselves might change once enough counterexamples are 
observed. As a result the algorithm becomes exponential 
time for a pAFD. We can, however, attempt to calculate 
the confidence of a pAFD approximately. 



Deterministic approximation: One of the ways in 
which a pAFD may be approximated is to first create a 
deterministic database by completely ignoring the proba- 
bilities and treating every uncertain option as a tuple of a 
deterministic database. Obviously, this will violate the se- 
mantics of disjoint possible worlds. We then find the AFD 
over this deterministic database, and report the confidence 
of that AFD as the confidence of the pAFD. 

Unioned approximation: A better approximation would 
be to create a tuple independent (TI) database by taking 
every uncertain option in the database along with its prob- 
ability and making it a tuple of the TI database. This 
causes us to lose all the correlation between the options 
of the same tuple in a TDI database, and we treat them 
as independent. Since this is effectively creating a union 
of all the options, we call this approach the unioned ap- 
proximation to finding the confidence of a pAFD. We then 
find the pAFD over this tuple-independent database. Hav- 
ing the tuples completely independent of each other makes 
finding the confidence much easier. To find the confidence 
of the dependency {X -^ Y), we can find all the associa- 
tion rules {x, y) in the database. For each distinct value 
of A, we find that value i/max of Y which has the maxi- 
mum support in the database. The pAFD value is given by 
E support{ymax) / T, support(x). 

Monte Carlo: In order to maintain intra-tuple correla- 
tions, we can sample a subset of the possible worlds, and 
compute the confidence of the pAFD over that subset. We 
can then scale it up appropriately to find the confidence of 
the pAFD over the entire relation. It is clear that the more 
representative a subset we sample, the more accurate our 
pAFD computation will be. To choose a well-represented 
subset of possible worlds, we use a Monte Carlo simulation. 

It is worth noting that the Monte Carlo technique does 
not make any assumptions about the Probabilistic Data- 
base under consideration, specifically, it is not restricted to 
TDI databases. For the case of the TDI database, we take 
one tuple at a time. For every tuple, we generate a ran- 
dom number to choose which option is to be picked. This 
is done proportional to the probability of the option. Since 
by definition, different tuples are independent of each other, 
this results in generating a Monte Carlo sample of the TDI 
database. We find the AFD confidence of the sampled pos- 
sible world, and weigh it with the probability of the possible 
world. We repeat this process till the the weighted average 
of the AFD values observed converges. That is reported as 
the confidence of the pAFD. 

4.3 Assessing the confidence of a CpFD 

Our strategy to find the confidence of the conditional de- 
pendencies - the CpFD and the CpAFD - is a simple one. 
We first select the tuples from the database that match the 
pattern tableau, and then we run our corresponding pFD or 
pAFD on the resulting relation. 

As Dalvi and Suciu show in I6|, query processing on a 
probabilistic database can be a T^P-hard problem. How- 
ever, there are certain queries that are guaranteed to have 
safe-plans which can be evaluated in polynomial time. For- 
tunately, selecting the tuples matching a pattern tableau 
is a safe query, and can be efficiently evaluated using the 
algorithms in m\. 

We follow Bohannon et. al. [4] for the query to find tuples 
that do not match the pattern tableau, appropriately mod- 
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Figure 4: Comparison between the time taken by 
the naive algorithm and the proposed algorithm 
for the confidence of a pFD on a log-scale. 
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Figure 5: Comparison of time taken for the al- 
gorithms for various dependencies to run vs the 
number of tuples. 



ified for a probabilistic relation. Given a probabilistic rela- 
tion R with attributes Ai,...An, and a CpFD {X -* Y,Tp), 
we can use the following query to find the probabilistic op- 
tions of the tuples that do not match the pattern tableau Tp 
and hence can be removed from consideration: 



5. EXPERIMENTAL EVALUATION 

We empirically verify our algorithms using two sets of 
data - one generated synthetically, and the other real data 
extracted from DBLP. 



where NOT(t[Xi] x tp[Xi] AND...ANDf[X,i] x tp[X„]) 

Here, t[X] x ip[^] represents the condition that either 
t[X] = tp[X] or ip[X] = _. We replace all these tuples with 
the special symbol @. Then we run the following query to 
find all tuples that violate the pattern tableau: 

select t from Rt, Tp tp 

where t[Xi] x fp[Xi] AND...ANDi[X„] xtp[X„]AND 
{t[Yi] t tp[Yi] OR... ORi[Y„] t tp[Yn]) 

Here t[Y] t tp[Y] represents the condition that both t[Y] + 
ip[F] and tp[i^] * _. We replace these options with the 
special symbol (j> to denote that it violates the tableau. The 
main diflterence from H in finding mismatches is that they 
found entire tuples, but here we find options of tuples. 

We now run our algorithm of section [4T] on this modified 
relation. Whenever we encounter the @ symbol, we treat it 
as if the tuple does not exist. Whenever we encounter the (p 
symbol, we treat it as if it violates the current set of rules. 
The resulting confidence is the confidence of the CpFD. 

We illustrate this process with an example. Consider 
the database of Figure [l] Consider the CpFD (Color -^ 
Type, Ti), where Ti is the single-tuple (Red, Star). We find 
that any tuple that does not have Red for the Color at- 
tribute matches the first query and is replaced with ®. The 
only two remaining options are (Sirius, Red, Star) and (Tau- 
rus, Red, Nebula), since Nebula t Star, the (Taurus, Red, 
Nebula) option matches the second query, and thus the tu- 
ple is replaced with a <j). We then run the pFD algorithm 
and obtain the confidence 0.5. 

4.4 Assessing the confidence of a CpAFD 

We follow the same principle as Section [4. 3| However, in 
this case we use the Monte Carlo algorithm of Section |4.2| 
instead of the algorithm for pFD to evaluate the confidence 
over the resulting relation. 



5.1 Synthetic Data 

Data: We are using the tuple-disjoint independent model. 
We generate synthetic data by creating a TDI database with 
n tuples, each tuple having 2 options, and the options gen- 
erated from the domain of the attributes of cardinality m. 
Each option may choose not to follow the specified FD with 
the probability called noise. 

Results: We evaluate the time taken by the pFD algo- 
rithm to run on synthetic data, while varying the number 
of tuples in the data. We compare this performance with 
the naive algorithm, which would enumerate all the possible 
worlds. The results are shown in Figure HI As can be clearly 
seen from the graph, the time taken by the naive algorithm 
grows exponentially, and quickly becomes infeasible to run, 
even with as few as 30 tuples. 

In Figure [5J we compare the time taken by the algorithms 
for the calculating the confidence of various dependencies. 
The cardinality of the domain of the attribute is held con- 
stant, while the number of tuples is increased. The Monte 
Carlo approximation algorithms for pAFD and CpAFD take 
significantly less time to converge compared to the pFD and 
CpFD algorithms. When plotted on a separate graph, it 
can be seen that they grow approximately linearly with the 
number of tuples. The pFD and CpFD show a general in- 
creasing trend with the number of tuples. The fluctuation 
observed is due to the algorithm quickly flnding conflicting 
data and terminating early in some cases. 

We also assessed the robustness of the dependencies to 
noise in the data in Figure [6] We observe that with slight 
introduction of corruption, the confldences of a pFD and 
CpFD drop sharply. The confidence of a pAFD does not 
fall too sharply, which shows us that when the data is likely 
to be noisy, pAFD should be used to mine dependencies. 
However, for data rectification, or in cases where the data is 
guaranteed to be clean, pFD will come in very useful, since 
the confidence value is very sensitive. 
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Figure 6: Comparison between the average con- 
fidence reported for the dependencies in a data- 
base for varying noise. 

5.2 DBLP Data 



We use a set of data modified from 12 . The database 



consists of DBLP |13| data, with additional probabilistic 
attributes added to it by various information retrieval and 
machine learning sources. We use the "Author" relation 
from this source, which contains information about approx- 
imately 700,000 computer science authors. This table has 
some deterministic attributes such as Name, MinYearOf- 
Publication, MaxYearOf Publication, NumPublication. It also 
has the following uncertain attributes: Institute, Country, 
Domain, Region and Subregion. We modify this dataset by 
re-indexing it and converting it into a tuple-disjoint inde- 
pendent format. 

In Figure IT] we show the results of running the Monte 
Carlo pAFD algorithm on a 200,000-tuple subset of this 
dataset to evaluate the confidence of the dependency In- 
stitute ~f Country. We show how the accuracy of the eval- 
uated confidence varies with the number of Monte Carlo 
simulations. We see that with increase in the number of 
simulations, the time taken increases, and the average error 
decreases, as expected. We terminate the simulation once 
the computed confidence converges. From this graph it is ap- 
parent that the it takes only around 100 simulations before 
the value stabilizes and the algorithm can be terminated. 

In Figure [S] we show the confidence values of the pAFD 
mined from this data for all 700,000 tuples using two dif- 
ferent approaches. As shown in Section [4.2| we can approx- 
imate the the confidence of a pAFD using three different 
approaches - the deterministic, union and Monte Carlo ap- 
proximations. This experiment shows that even the union 
approximation method gives significant differences in con- 
fidence values. In the context of mining the dependencies 
we would typically choose dependencies by placing a thresh- 
old on the confidence values or by taking the top-k mined 
dependencies. The difference we observed in the confidence 
values is significant enough that the union method would 
give different dependencies when mining. We also see that 
the unioned method can both underestimate (e.g. the first 
dependency in Figure [8| and overestimate the probabilities 
(e.g. the last dependency). Thus it seems that the Monte 
Carlo method is most suited to finding the confidence of 
pAFDs. 



Figure 7: The average error and the time taken 
vs the number of Monte Carlo simulations for a 
200,000 tuple database of DBLP data. 

5.3 Dependency mining 

In order to mine pAFD dependencies, we adapted the 
AFD Miner algorithm as described in [16]. We start with 
200,000 tuples from the DBLP dataset. We choose four un- 
certain attributes Institute, Country, Region and Subregion. 
We build a heirarchy of possible dependencies (the set-con- 
tainment lattice of the attributes). Using the AFD Miner 
algorithm, we prune our search space using two criteria: 
the redundancy and specificity. The redundancy condition 
prunes those dependencies that are guaranteed to hold since 
they are subsumed by the current set of dependencies. The 
specificity condition prunes those dependencies that are too 
specific to be considered as pAFDs. For each candidate de- 
pendency that did not get pruned off, we compute the con- 
fidence using the Monte Carlo algorithm. The exact details 
of the adaptation is described in the Appendix |9.3[ 

Figures[9]and[l0]show dependencies mined from the DBLP 
data using two different thresholds for specificity. In the first 
case we set a very low specificity requirement, causing us to 
discover only those dependencies that are likely to be more 
general across the data. In the second case we set a high 
specificity threshold value, allowing much more specific de- 
pendencies to be also discovered. As we can see, this results 
in more dependencies being discovered, at the cost of quality 
of the dependencies. 

6. RELATED WORK 

There is a large body of research that talks about associa- 
tion rules [2] and itemsets [5] , more commonly known as the 
market-basket analysis problem. This work on association 
rules was recently improved by Kalavagattu Tl to include 
pruning based on specificity and to roll them up into approx- 
imate functional dependencies. AFDs have also been used 
to mine attribute correlations on autonomous web databases 
by Wolf et al. 
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They have also been used by Wang et al. 
[15] to find dirty data sources and normalize large mediated 
schemas. FDs have also been generalized into conditional 
functional dependencies. Their role in data cleaning was 
shown by Bohannon et al. [i]. 

Sarma et al. extended FDs to probabilistic data in |14| , 
however, in that paper the dependencies that were pro- 
posed were appropriate for schema normalization, but were 
inappropriate for discovering hidden relationships in data. 



Dependency 


Monte 
Carlo 


Time 
(ms) 


Union 


Time 
(ms) 


Inst -> Ctry 


0.9492 


66844 


0.8805 


916 


Ctry-> Inst 


0.0932 


60143 


0.0977 


800 


Ctry-* SubRgn 


0.9875 


56104 


0.9685 


500 


Subrgn-* Ctry 


0.6954 


55507 


0.657 


611 


Ctry -> Region 


0.9821 


49451 


0.9598 


479 


Region -» Ctry 


0.6078 


51850 


0.58 


492 


Domain ->• Ctry 


0.6764 


98590 


0.6049 


1415 


Ctry-* Domain 


0.1144 


63951 


0.116 


882 



Dependency 






Confidence 


Institute ~> Region 






0.9752 


Country -^ Region 






0.9893 


Subregion ~> Region 






0.9942 1 


Institute -* Subregion 






0.9752 


Country ~> Subregion 






0.9930 




Ti 


me 


taken: 280s 



Dependency 


Confidence 


Institute ~> Country 


0.9751 


Country -• Region 


0.9893 


Country ~> Subregion 


0.9930 


Institute -> Region 


0.9751 


Subregion ~> Region 


0.9942 


Institute ~> Subregion 


0.9753 


Region, Country ~> Subregion 


0.9944 


Subregion, Country ~> Region 


0.9944 



Figure 8: The confidence of 
pAFD and time taken as com- 
puted by Monte Carlo method vs 
the union method. 



Figure 9: The dependencies dis- 
covered in DBLP data by mining 
pAFDs for specificity threshold 
= 0.3. 



Specifically, the horizontal dependencies specified can de- 
tect databases where the FD holds either in the union of 
all probabilistic tuples, each tuple individually, or within a 
specific tuple. The first two of these cases are intolerant to 
any noise in the data. The last one needs a single tuple to 
be specified, which is not holistic enough to discover any 
patterns. Such dependencies are ideal for schema normal- 
ization, since they allow the tables to be decomposed and 
simpler schema to be built, but it is not appropriate for 
discovering data patterns which needs to be fault tolerant. 

Monte Carlo methods have been used in probabilistic data- 
bases before, for example, [H] uses Monte Carlo methods to 
give top-k results for queries on probabilistic databases. A 
more general framework for probabilistic databases is pro- 
posed by Jampani et. al. in fO where the uncertainty is 
represented by parameters instead of probabilities, so that a 
more generalized model of uncertainty can be represented. 

In [s], Gupta and Sarawagi demonstrate how to create 
probabilistic databases that are an approximation of an in- 
formation extraction model and find that using the appropri- 
ate model of uncertainty in a database is important. If the 
model of uncertainty is too simple then interactions between 
elements of the generating model cannot be represented; if it 
is too complex then querying becomes inefficient. Similarly, 
in this paper, we are proposing the right level of uncertainty, 
but for functional dependencies. We show that using prob- 
abilistic semantics does cause a significant change in the 
confidence of the dependencies, and we show efficient algo- 
rithms that find these dependencies. 

7. CONCLUSIONS 

In this paper we defined a spectrum dependencies for 
probabilistic databases. These dependencies are logical ex- 
tensions of their deterministic counterparts. We explained 
how these dependencies are related to each other. We showed 
that pAFD would always have a larger confidence than the 
pFD. We showed the CpAFDs were the most general of all 
and that it subsumed every other kind of dependency. We 
then presented algorithms to assess the confidence of each of 
these dependencies. We empirically verified the algorithms - 
the ones for pFD and CpFD were exponential in the number 
of values of the attribute, and approximately linear in the 
number of tuples. The Monte Carlo algorithms for the ap- 
proximate dependencies converged fast and were accurate. 
We also showed experiments with real data that demon- 
strated that the Monte Carlo algorithm converges quickly. 
Finally we showed how we can use these algorithms to effec- 



Time taken: 605s 

Figure 10: The dependencies dis- 
covered in DBLP data by mining 
pAFDs with specificity threshold 
= 0.6. 

tively mine dependencies from a real probabilistic database 
and discover useful dependencies. We are currently explor- 
ing the use of these dependencies in the QPIAD project [16] . 
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9. APPENDIX 

9.1 Adapting the algorithms for TI databases 

We can use the pFD algorithm for TI databases with a 
slight modification. Recall the symbol @ introduced in Sec- 
tion |4.3| for handling options eliminated for not matching 
the pattern tableau. The symbol represents that in further 
computation, the option will be ignored for the purpose of 
computing the confidence, that is, it will not conflict with 
any existing rule. For every tuple in the TI database whose 
probability p is less than 1, we convert it into a TDI data- 
base by adding an option to it consisting of the @ symbol 
and probability 1 -p. We then have a TDI database which is 
essentially equivalent to the TI database. We can now apply 
the pFD algorithm on this database to assess its confidence. 

The union approximation for assessing the confidence of 
a pAFD from Section [4. 2| can be applied directly to the TI 
database to get the accurate value of the pAFD. 

9.2 Adapting specificity for probabilistic data- 
bases 

In this section we will first introduce the notion o f sp eci 
ficity as described by Kalavagattu and Wolf et 
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for deterministic databases and then show how it is adapted 
to probabilistic databases. 

Deterministic databases: The distribution of values 
for the determining set is an important measure to judge the 
"usefulness" of an AFD. For an AFD X -^ A, having fewer 
distinct values of X means that there exist more tuples in 
the database that have the same values of X. This makes 
the AFD potentially more relevant. This is because if every 
value of X is distinct (i.e. it is a key) then the AFD trivially 
holds; however if the dependency holds in spite of X having 
only a few distinct values, the AFD has a deeper semantic 
meaning. 

To quantify this, we first define the support of a value 
ai of an attribute set X, support{ai), as the occurrence 
frequency of value Ui in the training set. The support is de- 
fined as support{ai) = count{ai)/N, where A*' is the number 
of tuples in the training set. 

Now we measure how the values of an attribute set X 
are distributed using specificity. Specificity is defined as 
the information entropy of the set of all possible values of 
attribute set X: {ai, 012, ..., Qm }, normalized by the 
maximal possible entropy (which is achieved when X is a 
key). Thus, specificity is a value that lies between and 1. 
^T,T support{ai) X \og2{support{ai)) 



specificity (X) 



log2(7V) 



When there is only one possible value of X, then this value 
has the maximum support and is the least specific, thus we 
have specificity equals to 0. When there are many distinct 
values in X, each having a low support and are specific, we 
have a high value of specificity. When all values of X are 
distinct (when X is a key), each value has the minimum 
support and is most specific and has specificity equal to 1. 

Now we overload the concept of specificity on AFDs. The 
specificity of an AFD is defined as the specificity of its deter- 
mining set. i.e. specificity (X -^ A) = specificity (X). The 
lower specificity of an AFD, potentially the more relevant 
possible answers can be retrieved using the rewritten queries 
generated by this AFD, and thus a higher recall for a given 
number of rewritten queries. 



Intuitively, specificity increases when the number of dis- 
tinct values for a set of attributes increases. Consider two 
attribute sets X and Y such that YdX. Since Y has more at- 
tributes than X, the number of distinct values of Y is no less 
than that of X, specificity(y) is no less than specificity (X). 

Probabilistic Databases: In a probabilistic database, 
the specificity of an attribute set X would be defined as 
the weighted average of the specificity of X in each pos- 
sible world. Computing this is potentially exponential in 
the number of tuples, since every possible world will have a 
different set of association rules with different support. 

In this paper, we are using specificity to prune our search 
space. We need to be able to compute the specificity very 
quickly so that we do not spend too much time deciding 
whether or not to prune the current subspace of dependen- 
cies. As a result, we decide to approximate the computation 
of specificity by using a method similar to the union method 
described in Section [4. 2[ We ignore the intra-tuple correla- 
tions, and create a TI database by taking the union of all the 
options in our TDI database. Computing the specificity of a 
TI database is a straightforward adaptation of the determin- 
istic algorithm. The definition for specificity(X) remains the 
same, but we redefine support{ai) as (where i represents all 
the tuples in the TI database): 

support{ai) = ^ prob{t)/^prob{t) 



9.3 Adaptations to AFDMiner 

We adapt the AFDMiner algorithm from Wolf et al. 



16 



to mine dependencies in our data. In this section we de- 
scribe the outline of the algorithm, and the adaptations to 
probabilistic data. 

The algorithm searches through the set-containment lat- 
tice of the attributes of the relation. This lattice consists of 
all possible sets of attributes. Each set of attribute has a di- 
rected edge that points to all sets that contain one attribute 
more than itself. The algorithm performs a breadth-first 
search through this lattice, starting with the null set of at- 
tributes and working its way up to the set of all attributes. 
For each directed edge (X, X u {A}) the algorithm travels 
along, the dependencies (X ~>- A) are tested. AFDMiner 
outputs those dependencies whose confidence is larger than 
the supplied confidence threshold. We adapt AFDMiner by 
supplying our own confidence assessing functions for pAFD. 

Pruning: Each attribute set X is tested for its specificity 
value. If the value is higher than the specificity threshold, 
then all outgoing edges from that set are removed from the 
lattice. This lets the algorithm prune the space of dependen- 
cies whose body is X or its superset, since they are guar- 
anteed to be above the specificity threshold. We use the 
specificity definition from Section [9. 2| for this algorithm. 

AFDMiner further prunes the space of dependencies based 
on redundancy. For any dependencies that hold exactly, any 
superset of the dependency would also hold (by Armstrong's 
Axioms), and hence need not to be checked. So, effectively, 
a list of exact FDs is maintained, and any superset of the 
FDs in this list are not checked. In our case, the algorithm 
to check for a pFD is significantly more expensive than the 
algorithm to compute the confidence of a pAFD. As a result, 
we adapt this condition by replacing FDs with very high 
confidence pAFDs. For any pAFD that has a confidence 
larger than the preset high-confidence threshold, we prune 
the outgoing edges from that attribute. 



